43 research outputs found

    Design of Adiabatic MTJ-CMOS Hybrid Circuits

    Full text link
    Low-power designs are a necessity with the increasing demand of portable devices which are battery operated. In many of such devices the operational speed is not as important as battery life. Logic-in-memory structures using nano-devices and adiabatic designs are two methods to reduce the static and dynamic power consumption respectively. Magnetic tunnel junction (MTJ) is an emerging technology which has many advantages when used in logic-in-memory structures in conjunction with CMOS. In this paper, we introduce a novel adiabatic hybrid MTJ/CMOS structure which is used to design AND/NAND, XOR/XNOR and 1-bit full adder circuits. We simulate the designs using HSPICE with 32nm CMOS technology and compared it with a non-adiabatic hybrid MTJ/CMOS circuits. The proposed adiabatic MTJ/CMOS full adder design has more than 7 times lower power consumtion compared to the previous MTJ/CMOS full adder

    EVALUATING THE IMPACT OF MEMORY SYSTEM PERFORMANCE ON SOFTWARE PREFETCHING AND LOCALITY OPTIMIZATIONS

    Get PDF
    Software prefetching and locality optimizations are two techniques for overcoming the speed gap between processor and memory known as the memory wall as suggested by Wulf and Mckee. This thesis evaluates the impact of memory trends on the effectiveness of software prefetching and locality optimizations for three types of applications: regular scientific codes, irregular scientific codes, and pointer-chasing codes. For many applications, software prefetching outperforms locality optimizations when there is sufficient bandwidth in the underlying memory system, but locality optimizations outperform software prefetching when the underlying memory system doesn't provide sufficient bandwidth. The break-even point, or equivalently the crossover bandwidth point, occurs at roughly 2.4 GBytes/sec , for 1 GHz processors on today's memory systems, and will increase on future memory systems. This thesis also studies the interactions between software prefetching and locality optimizations when applied in concert. Naively combining the two techniques provides a more robust application performance in the face of variations in memory bandwidth and/or latency, but does not yield additional performance gains. In other words, the performance won't be better than the best performance of the two techniques alone. Also, several algorithms are proposed and evaluated to better combine software prefetching and locality optimizations, including an enhanced tiling algorithm, padding for software prefetching, and index prefetching. (Also UMIACS-TR-2002-72

    Locality Transformations and Prediction Techniques for Optimizing Multicore Memory Performance

    Get PDF
    Chip Multiprocessors (CMPs) are here to stay for the foreseeable future. In terms of programmability of these processors what is different from legacy multiprocessors is that sharing among the different cores (processors) is less expensive than it was in the past. Previous research suggested that sharing is a desirable feature to be incorporated in new codes. For some programs, more cache leads to more beneficial sharing since the sharing starts to kick in for large on chip caches. This work tries to answer the question of whether or not we can (should) write code differently when the underlying chip microarchitecture is powered by a Chip Multiprocessor. We use a set three graph benchmarks each with three different input problems varying in size and connectivity to characterize the importance of how we partition the problem space among cores and how that partitioning can happen at multiple levels of the cache leading to better performance because of good utilization of the caches at the lowest level and because of the increased sharing of data items that can be boosted at the shared cache level (L2 in our case) which can effectively be a prefetching effect among different compute cores. The thesis has two thrusts. The first is exploring the design space represented by different parallelization schemes (we devise some tweaks on top of existing techniques) and different graph partitionings (a locality optimization techniques suited for graph problems). The combination of the parallelization strategy and graph partitioning provides a large and complex space that we characterize using detailed simulation results to see how much gain we can obtain over a baseline legacy parallelization technique with a partition sized to fit in the L1 cache. We show that the legacy parallelization is not the best alternative in most of the cases and other parallelization techniques perform better. Also, we show that there is a search problem to determine the partitioning size and in most of the cases the best partitioning size is smaller than the baseline partition size. The second thrust of the thesis is exploring how we can predict the best combination of parallelization and partitioning that performs the best for any given benchmark under any given input data set. We use a PIN based reuse distance profile computation tool to build an execution time prediction model that can rank order the different combinations of parallelization strategies and partitioning sizes. We report the amount of gain that we can capture using the PIN prediction relative to what detailed simulation results deem the best under a given benchmark and input size. In some cases the prediction is 100% accurate and in some other cases the prediction projects worse performance than the baseline case. We report the difference between the simulation best performing combination and the PIN predicted ones as well as other statistics to evaluate how good the predictions are. We show that the PIN prediction method performs very well in predicting the partition size compared to predicting the parallelization strategy. In this case, the accuracy of the overall scheme can be highly improved if we only use the partitioning size predicted by the PIN prediction scheme and then we use a search strategy to find the best parallelization strategy for that partition size. In this thesis, we use a detailed performance model to scan a large solution space for the best parameters for locality optimization of a set of graph problems. Using the M5 performance simulation we show gains of up to 20% vs. a naively picked baseline case. Our prediction scheme can achieve up to 100% of the best performance gains obtained using a search method on real hardware or performance simulation without running at all on the target hardware and up to 48% on average across all of our benchmarks and input sizes. There are several interesting aspects to this work. We are the first to devise and verify a performance model against detailed simulation results. We suggest and quantify that locality optimization and problem partitioning can increase sharing synergistically to achieve better performance overall. We have shown a new utilization for coherent reuse distance profiles as a helping tool for program developers and compilers to a optimize program's performance

    Multi-criteria Hardware Trojan Detection: A Reinforcement Learning Approach

    Full text link
    Hardware Trojans (HTs) are undesired design or manufacturing modifications that can severely alter the security and functionality of digital integrated circuits. HTs can be inserted according to various design criteria, e.g., nets switching activity, observability, controllability, etc. However, to our knowledge, most HT detection methods are only based on a single criterion, i.e., nets switching activity. This paper proposes a multi-criteria reinforcement learning (RL) HT detection tool that features a tunable reward function for different HT detection scenarios. The tool allows for exploring existing detection strategies and can adapt new detection scenarios with minimal effort. We also propose a generic methodology for comparing HT detection methods fairly. Our preliminary results show an average of 84.2% successful HT detection in ISCAS-85 benchmar

    Predicting Expressibility of Parameterized Quantum Circuits using Graph Neural Network

    Full text link
    Parameterized Quantum Circuits (PQCs) are essential to quantum machine learning and optimization algorithms. The expressibility of PQCs, which measures their ability to represent a wide range of quantum states, is a critical factor influencing their efficacy in solving quantum problems. However, the existing technique for computing expressibility relies on statistically estimating it through classical simulations, which requires many samples. In this work, we propose a novel method based on Graph Neural Networks (GNNs) for predicting the expressibility of PQCs. By leveraging the graph-based representation of PQCs, our GNN-based model captures intricate relationships between circuit parameters and their resulting expressibility. We train the GNN model on a comprehensive dataset of PQCs annotated with their expressibility values. Experimental evaluation on a four thousand random PQC dataset and IBM Qiskit's hardware efficient ansatz sets demonstrates the superior performance of our approach, achieving a root mean square error (RMSE) of 0.03 and 0.06, respectively

    Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics

    Full text link
    The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Resource disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to efficiently realize this capability and cost-effectively meet the stringent bandwidth and latency requirements of HPC applications. To that end, we describe how modern photonics can be co-designed with modern HPC racks to implement flexible intra-rack resource disaggregation and fully meet the bit error rate (BER) and high escape bandwidth of all chip types in modern HPC racks. Our photonic-based disaggregated rack provides an average application speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared to a similar system that instead uses modern electronic switches for disaggregation. Using observed resource usage from a production system, we estimate that an iso-performance intra-rack disaggregated HPC system using photonics would require 4x fewer memory modules and 2x fewer NICs than a non-disaggregated baseline.Comment: 15 pages, 12 figures, 4 tables. Published in IEEE Cluster 202

    BB-ML: Basic Block Performance Prediction using Machine Learning Techniques

    Full text link
    Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a large code into manageable pieces. We extrapolate the basic block execution counts of GPU applications and use them for predicting the performance for large input sizes from the counts of smaller input sizes. We train a Poisson Neural Network (PNN) model using random input values as well as the lowest input values of the application to learn the relationship between inputs and basic block counts. Experimental results show that the model can accurately predict the basic block execution counts of 16 GPU benchmarks. We achieve an accuracy of 93.5% in extrapolating the basic block counts for large input sets when trained on smaller input sets and an accuracy of 97.7% in predicting basic block counts on random instances. In a case study, we apply the ML model to CUDA GPU benchmarks for performance prediction across a spectrum of applications. We use a variety of metrics for evaluation, including global memory requests and the active cycles of tensor cores, ALU, and FMA units. Results demonstrate the model's capability of predicting the performance of large datasets with an average error rate of 0.85% and 0.17% for global and shared memory requests, respectively. Additionally, to address the utilization of the main functional units in Ampere architecture GPUs, we calculate the active cycles for tensor cores, ALU, FMA, and FP64 units and achieve an average error of 2.3% and 10.66% for ALU and FMA units while the maximum observed error across all tested applications and units reaches 18.5%.Comment: Accepted at the 29th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2023
    corecore